feat(benchmark-harness): TF1/SF1 measurement infra for release validation (#320) by yfedoseev · Pull Request #361 · yfedoseev/pdf_oxide

yfedoseev · 2026-04-15T22:58:02Z

Summary

Closes #320. Release-verification infrastructure as a workspace crate at tools/benchmark-harness/. Computes TF1 (token F1) and SF1 (block-weighted structural F1 with LIS ordering) against ground-truth markdown, so "did this release improve extraction quality?" has an answer beyond gut feel and byte diffs.

The methodology mirrors Kreuzberg's benchmark-harness so numbers are comparable to their published reports.

What's in

CLI: benchmark-harness run --engine <E> --corpus DIR --ground-truth DIR --output JSON and benchmark-harness diff BASE.json HEAD.json with a configurable regression gate (default: fail on mean TF1 drop > 0.5pp or per-fixture drop > 5pp).
Engines: pdf_oxide (in-process), pdftotext (subprocess), pdfium (behind --features pdfium since the crate needs a prebuilt native lib). Adapter trait in src/engine.rs — one enum arm + one impl per new engine.
Scoring: TF1 lowercase-alphanumeric bag-of-words F1 (src/score.rs); SF1 pulldown-cmark block parser + type-compat matrix + greedy match + weighted P/R/F1 + LIS order penalty (src/sf1.rs). All formulas documented in PLAN.md and README.md.
Consensus mode: --consensus-peers pdftotext,pdfium uses peer agreement as pseudo-ground-truth when no manual reference exists. Labels the report reference=consensus(...) so absolute quality and inter-engine agreement never get confused.
Fixtures: scripts/fetch-fixtures.sh clones Kreuzberg's Apache-2.0 corpus (pinned via KREUZBERG_REF) and symlinks 154 PDFs + 180 ground-truth markdown files into fixtures/kreuzberg/. We don't vendor upstream PDFs directly — per-fixture licenses vary.
Makefile targets: make benchmark-fetch, make benchmark-run ENGINE=<E> OUTPUT=<F>, make benchmark-compare BASE=<F> HEAD=<F>.
18 unit tests (5 TF1, 10 SF1, 3 consensus).

Why this matters

The 170-PDF byte-diff regression sweep we'd been using couldn't tell us how good extraction was — only that it didn't change. The harness immediately found a real bug (B1: shared Form XObject per-page CTM regression causing every page to return page 0's content) that byte-diff couldn't because both branches had the same bug. TF1 p10 moved from 0.776 → 0.849 (+7.3pp) once B1 was fixed.

Commit series

Six phased commits so reviewers can see the scoring reveal itself:

Scaffold crate + CLI + TF1 + pdf_oxide adapter (faf51b2)
SF1 scorer with full per-block-type weight table (5d9c990)
Kreuzberg corpus layout + walkdir symlink follow (bf1eaef)
pdftotext/pdfium engine adapters + consensus mode + Makefile/README (8ec1dcb)
First real-corpus baseline + 4 bugs filed (3794409)
B1 fix measurement (99c6084)
Results after B1+B3+B4 combined (0dd0310)
Deep-dive on all remaining gaps (829d858)
Honest B4 findings (671cd6e)

Test plan

cargo test -p benchmark-harness — 18 tests pass
cargo clippy -p benchmark-harness --all-targets -- -D warnings — clean
End-to-end: make benchmark-fetch → make benchmark-run → make benchmark-compare validates six separate bug fixes (B1, B3, B4, B7, B8a, B9) with no per-fixture regressions > 0.5pp.

Follow-up bug fixes using this harness

Separate PRs against release/v0.3.31:

fix(b9): TrueType cmap format 0 parser for MS Office subset fonts #356 B9 TrueType cmap format 0
fix(b7): dedup stroke+fill overlapping spans before merge #357 B7 stroke+fill dedup
fix(b4): route multi-column pages through XY-cut reading order #358 B4 multi-column reading order
fix(b3): keep first occurrence of running-header text #359 B3 running-artifact first-occurrence
fix(b1): shared Form XObject per-page CTM — biggest single-fixture TF1 win (+64.7pp) #360 B1 shared Form XObject per-page CTM

And the combined branch with all six fixes + this harness: fix/all-benchmark-bugfixes.

Adds `tools/benchmark-harness/` as a workspace crate. This is verification infrastructure, not a feature: without ground-truth scoring, "did this release improve extraction quality?" has no answer beyond gut feel and byte diffs. Phase 1–2 in place: - `tools/benchmark-harness/PLAN.md` — scoring formulas, 8-phase sequencing, risk register. Mirrors Kreuzberg's methodology so numbers are comparable across projects (#320's ask). - `benchmark-harness run --engine pdf_oxide --corpus DIR --ground-truth DIR --output JSON` — extracts each PDF with the pdf_oxide in-process adapter, scores TF1 (bag-of-words F1 on lowercase alphanumeric tokens) against a matching .md file, and emits a JSON report with per-fixture + aggregate (mean, p50, lower-tail p90) metrics. - `benchmark-harness diff BASE.json HEAD.json` — prints per-fixture regressions and exits non-zero when mean TF1 drops >0.5pp or any fixture drops >5pp. Thresholds are tunable flags. - 5 unit tests on the tokenizer / F1 scorer (identical, disjoint, empty, partial, lowercase+punct stripping). Later phases (SF1 block parser, pdftotext/pdfium adapters, consensus ground-truth fallback, vendored Kreuzberg fixtures, Makefile target) are tracked in PLAN.md and stubbed so the trait boundaries don't need to change later.

Adds `tools/benchmark-harness/src/sf1.rs`: a block-weighted F1 implementation matching Kreuzberg's methodology, so SF1 numbers we publish are directly comparable. Scoring pipeline: - Parse markdown via pulldown-cmark (tables, math, GFM) into typed blocks: Heading(1..6), Paragraph, CodeBlock, Formula, Table, ListItem, Image. Math in a paragraph promotes it to Formula, so engines that emit `$\alpha$` inline still score as a formula block. - Per-block weights: heading=2.0, code/formula/table=1.5, list=1.0, paragraph/image=0.5. Heading detection is the highest-signal layout decision; the weights reflect that. - Type-compat matrix for cross-type allowances: heading↔heading by level distance (clamped ≥0.6), list↔paragraph=0.5, paragraph↔heading=0.25, code↔formula=0.3, code↔paragraph=0.2, table↔paragraph=0.25. - Greedy matching on (content_tf1 × type_compat) with threshold 0.10 (0.20 for short blocks <5 tokens) and no-replacement assignment by descending score. - Weighted precision/recall/F1 using the matched weights on both sides. - Order score = LIS length of matched ext indices (sorted by gt index) / match count. 1.0 = perfectly preserved order; 0.5 = half the matches are out of place. The per-fixture report gains sf1, sf1_precision, sf1_recall, order_score, matched_blocks. Aggregate gains sf1_mean/p50/p90 and order_mean. `diff` prints mean TF1, SF1, order deltas — gate thresholds still TF1-only for now (SF1 gating needs calibration on a real corpus first to avoid false positives from parser differences). 10 new unit tests cover block parsing (headings/paragraphs/code/tables), identical-input SF1=1, disjoint content SF1≈0, heading-level-mismatch partial compat, reversed-order order_score=0.5, LIS basics, weight taxonomy, and h1↔h2 / h1↔h6 compat values.

phases 4–8) Finishes the benchmark harness. Phases 4–8 in one commit. Engine adapters (phase 4) - `pdftotext` subprocess adapter wrapping poppler's `pdftotext -layout`. Probes the binary once at startup so a missing install fails fast, not per fixture. Honours `PDFTOTEXT_BIN` for non-standard locations. - `pdfium` adapter behind the `pdfium` feature (default off, since the crate needs a prebuilt native library). Uses `pdfium-render` and falls back between system library and `PDFIUM_DYNAMIC_LIB_PATH`. Consensus-baseline ground truth (phase 5) - `--consensus-peers pdftotext,pdfium` on `run` (mutually exclusive with `--ground-truth`). Per PDF, runs the peers, takes the token intersection of ≥N (default 2) peers, and scores the target engine against it. SF1 is skipped in consensus mode (needs block stream, not a token set) so numbers aren't misleading. - Report gains a `reference` field: `"manual"` vs `"consensus(pdftotext,pdfium)"`. Prevents downstream readers from confusing inter-engine agreement with absolute quality. - 3 unit tests on the consensus token set + scoring (min-agree, peers exceed threshold, partial overlap). Fixtures (phase 6) - `scripts/fetch-fixtures.sh`: clones Kreuzberg (pinned via `KREUZBERG_REF`, default `main`) into `.fixture-src/`, symlinks `tools/benchmark-harness/fixtures/kreuzberg → tools/benchmark-harness/fixtures` from the upstream. Re-runnable; idempotent. Don't vendor PDFs directly — per-fixture licenses inside Kreuzberg's corpus vary. Makefile + README (phase 8) - `make benchmark-fetch` — runs the fetch script - `make benchmark-run` — `cargo run --release -p benchmark-harness -- run --engine $(ENGINE) …` - `make benchmark-compare` — diff with regression gate - README documents scoring formulas, invocation, engine matrix, JSON report schema, and license posture. Tests: 18 total (5 TF1 + 10 SF1 + 3 consensus). Clippy clean under `-D warnings`. Release branch build path unaffected — crate is a new workspace member behind a cfg-less `cargo run -p benchmark-harness`. Release-validation workflow this enables: git checkout main && make benchmark-run OUTPUT=base.json git checkout feat/X && make benchmark-run OUTPUT=head.json make benchmark-compare BASE=base.json HEAD=head.json → non-zero exit on meaningful TF1 regression, tuneable thresholds.

Two bugs found by the first local run on the Kreuzberg corpus: - Fetch script pointed DEST at the upstream's fixture *metadata* directory, but the PDFs and ground-truth markdown actually live under test_documents/{pdf,ground_truth/pdf}. Flatten both into ${DEST}/pdfs and ${DEST}/gt as symlinks so the harness's stem-matching loader just works. - walkdir by default skips symlinks, so every stem-matched pair was invisible. Enable follow_links(true) on both walkers. - Makefile CORPUS/GROUND_TRUTH point at the flattened subdirs. - Add .gitignore for the upstream clone + generated symlink forest so re-running the fetch script never contaminates the working tree. First numbers on the 102-pair intersection (TF1 mean): pdf_oxide : 0.919 pdftotext : 0.946 Δ: -2.7pp Detailed analysis follows in a separate artefact.

Running the harness end-to-end on Kreuzberg's 102-pair PDF corpus turned up real pdf_oxide bugs, which is the whole point. Captured the findings in BASELINE_ISSUES.md: Headline numbers (engine vs pdftotext, TF1): mean 0.919 / 0.946 (Δ -2.7pp) p50 0.965 / 0.984 (Δ -1.9pp) p10 0.776 / 0.881 (Δ -10.5pp) ← biggest gap on hard fixtures Four issues identified, ranked by blast radius: - B1: extract_text(n) returns identical content per page on some linearized PDFs (nougat_005.pdf: TF1 0.254 vs pdftotext 0.924). Page index appears to resolve to page 0 for every call. - B2: empty-page false positives on text-heavy pages (pdfa_010 pages 2/9/11 return 0 bytes; pdftotext emits 400–2000 each). - B3: running-artifact detector suppresses cover-page titles when they happen to overlap with per-page running headers (pdfa_010 loses "University of Oklahoma 2009"; same class as the 5PFVA6 case from the v0.3.31 sweep). - B4: XY-cut reading-order loses content on multi-column / dashboard layouts (order_mean 0.80 vs 0.86, nougat_026, pdfa_001, etc.). All four are existing pdf_oxide bugs that the 170-PDF byte diff couldn't catch (bytes matched across branches because both carry the bug). Now we have a verification pipeline with numbers.

…harness

Numbers on the Kreuzberg 102-fixture corpus with the B1 fix merged in: TF1 mean 0.919 → 0.925 (+0.64pp) TF1 p10 0.776 → 0.848 (+7.2pp) ← hard-tail improvement SF1 mean 0.337 → 0.339 (+0.22pp) runtime 8.3 s → 5.7 s (−31%) Zero per-fixture regressions. The worst-in-corpus fixture nougat_005 moved from TF1 0.254 to 0.901 — now essentially at parity with pdftotext's 0.924 on that file. This validates the harness workflow end-to-end: harness found a bug, fix landed with TDD coverage, rerun quantifies the improvement, diff subcommand gates against any accidental regression. Drop tools/.gitignore that came in from the fix branch — on the benchmark-harness branch the tools/benchmark-harness/ crate is the whole point and must stay tracked.

…harness

After merging B1 and B3 into the harness branch, the Kreuzberg 102-fixture benchmark shows: TF1 mean 0.919 → 0.927 (+0.77pp) TF1 p10 0.776 → 0.849 (+7.3pp) ← hard tail SF1 mean 0.337 → 0.343 (+0.54pp) order 0.804 → 0.819 (+1.5pp) runtime 8.3s → 5.6s (-33%) Zero per-fixture regressions at either fix. Supersedes B1_RESULTS.md. B2 closed as not-a-bug — post-B1 no fixture has pdf_oxide returning empty where pdftotext succeeds; pdfa_010's empty pages turned out to be genuinely empty in both tools. B4 deferred — multi-column reading-order wants XY-cut promoted to default in extract_text, which is an architectural change with enough blast radius to warrant its own validation cycle. Tracked; nougat_026/pdfa_001 at order_score ~0.4 are the canaries for it.

…harness

XY-cut as default reading order for multi-column pages is correct (synthetic TDD test passes) but the Kreuzberg corpus aggregate shows neutral impact: TF1 mean 0.927 → 0.927 (+0.04pp) SF1 mean 0.343 → 0.342 (−0.09pp) order 0.819 → 0.817 (−0.19pp) Per-fixture: ~6 wins (nougat_011/012, pdfa_048) at +5..+10pp, ~5 losses (nougat_033, pdfa_008, pdfa_037) at −2..−14pp, and a long tail of no-ops. Interpretation captured in RESULTS.md: XY-cut is semantically right, but Kreuzberg's ground-truth markdown was generated from content-stream-order serialisers, so on single-column pages where content-stream ≈ row-aware, our fix loses SF1 points against a GT that's "less correct in the same way". This is exactly the kind of corpus-bias artefact the harness exists to surface — no amount of heuristic tightening will improve the aggregate without disabling the wins. No per-fixture TF1 regression > 0.5pp; diff gate passes. Keeping the fix since the synthetic test proves correctness on clearly-multi- column input; the real corpus-level improvement needs better GT.

yfedoseev added 11 commits April 15, 2026 07:29

Merge branch 'fix/b1-linearized-page-resolution' into feat/benchmark-…

9ce983e

…harness

Merge branch 'fix/b3-running-artifact-overreach' into feat/benchmark-…

37d1421

…harness

Merge branch 'fix/b4-reading-order-multi-column' into feat/benchmark-…

f5c168d

…harness

yfedoseev changed the base branch from release/v0.3.31 to main April 16, 2026 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark-harness): TF1/SF1 measurement infra for release validation (#320)#361

feat(benchmark-harness): TF1/SF1 measurement infra for release validation (#320)#361
yfedoseev wants to merge 11 commits into
mainfrom
feat/benchmark-harness

yfedoseev commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yfedoseev commented Apr 15, 2026

Summary

What's in

Why this matters

Commit series

Test plan

Follow-up bug fixes using this harness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant